Wisteria

Topic: Life Expectancy and Happiness Indicators

Team members:

Part 1: Project Proposal

The Project Proposal covers the following sections:-

  1. Initial Question
  2. Overview and Motivation
  3. Related Work
  4. Data Cleaning & Data Pre-processing

1. Initial Question

What is the relationship between happiness indicators and life expectancy from a global perspective?

2. Overview and Motivation

There is always an ongoing debate about whether money can buy people happiness in the Malaysian community. While most people agree that money cannot buy a happy life from a sentimental point of view, it is undeniable that people can expect to live a quality life with money. Moreover, a higher quality of life generally indicates greater happiness linked to improved health, leading to higher life expectancy (Lozano and Sole-Auro, 2021). This motivates Wisteria Team to explore how happiness determinants correlate with one’s life expectancy.

Wisteria team will be exploring how the happiness indicators (independent variables) provided in the World Happiness Report 2022 correlate with life expectancy (dependent variable). The happiness indicators refer to scores used to calculate happiness rankings across countries, whereas life expectancy is defined as “the number of years a person can expect to live” (Murillo, 2016.)

The World Happiness Report provides data for both happiness indicators and life expectancy across different countries.

No. Hapiness Indicators Meanings
1 log GDP per capita Measurement of the economic output of a nation per person (Investopedia).
2 social support National average of the binary responses (1 = YES, 0 = NO) to the GWP ("Gallup World Poll") question: "If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?".
3 healthy life expectancy at birth An estimate of the average number of years babies born this year would live in a state of good general health if mortality levels and good health level at each age remain constant in the future (Goverment of UK).
4 freedom to make life choices National average of the binary responses (1 = YES, 0 = NO) to the GWP question "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?"
5 generosity Residual of regressing the national average of GWP responses to the donation question: "Have you donated money to a charity in the past month?" on log GDP per capita.
6 perceptions of corruptions National average of the binary responses (1 = YES, 0 = NO) to the 2 GWP questions: "Is corruption widespread throughout the government in this country or not?" and "Is corruption widespread within businesses in this country or not?"
7 positive affect Average of previous-day affect measures for laughter, enjoyment, and doing or learning something interesting through series of affect questions.
8 negative affect Average of previous-day affect measures for worry, sadness, and anger through series of affect questions.
9 life ladder Happiness score determined by national average respones to the questions of life evaluations.
10 age dependency ratio Ratio of dependent population to the working population which indicates financial stress level.

3. Related Work

From a data-informed point of view, the vast amount of information in the Report and the motivation mentioned earlier have inspired Wisteria Team to conduct a small research on the relationship between happiness indicators and life expectancy.

Upon starting the data understanding stage, the Wisteria Team determined to scan through the Report and conduct further reading. The Report stated that the variables (happiness indicators) were taken from the Gallup World Poll surveys from 2019 to 2021. In other words, these variables originated from the answers participants provided in a series of life evaluation questions in the survey. The responses were then translated into scores (categorical to continuous data), allowing the team to carry out a quantitative study.

From a sociological perspective, the Wisteria Team skimmed through a book read by millions worldwide, “The Top Five Regrets of the Dying - A Life Transformed by the Dearly Departing”. In this book, the Dying expressed their deepest regrets at the end of their lives. The team collected these regrets to understand better their research topic regarding the implicit link between happiness, regrets and life expectancy. Furthermore, the team utilises this chance, hoping to spread awareness in the Malaysian community on the significance of being happy.

Top 5 regrets of the dying include:

Being happy is a choice.

Lozano, M., & Sole-Auro, A. (2021). Happiness and life expectancy by main occupational position among older workers: Who will live longer and happy? SSM – Population Health 13.

Murillo, I.L. (2016). The life expectancy: what is it and why does it matter. Cenie.

The Top Five Regrets of the Dying

World Happiness Report 2022

4. Data Cleaning & Data Pre-Processing

Firstly, import pandas and sklearn libraries.

Get CSV from our file directory.

First and foremost, df_happiness was our base line dataframe. We were going to integrate df_age_ratio in df_hapiness dataframe.

The strategy was to pick the exact value in df_age_ratio based on country and year.

Row : country

Column: year

However, there were few countries could not be found. We looked into it, and discerned 3 possible scenarios which might have caused this issue:

  1. Country name has changed post report (Inconsistency?);
  2. Same country but different representation (Naming Convention);
  3. The country does not exist in the df_age_ratio.

For scenario 1 and 2, we would create a dict to transform the name of undetected countries, and match the country name in df_hapiness. Whilst for scenario 3, we would insert "None" value.

The next integration step was to update the df_hapiness until year 2021. From the data source, we found that there were 2 excel files. The first excel file consisted of data in year 2021 whilst the second one from year 2000 to year 2020. Therefore, we should update our dataset by integrating df_hapiness_2021 into df_hapiness.

Here, we would use the inner join (intercept) concept to combine df_happiness_2021 and df_happiness by matching each column existed in df_hapiness. In order to do this, we needed to rename some column and add new column in df_hapiness_2021, so that the column is matching.

*left join might be possible

We completed our dataset integration upon successful creation of hapi_age_latest.csv file.

Next, we proceeded to subsequent data preprocessing task that focuses on data cleaning and data reduction.

Step 1: Remove Duplicates

We started by importing libraries needed, and creating a copy of sort_df - df.

We used drop_duplicates() function to return dataframe with duplicate rows removed. The result showed all 2098 rows, indicating the dataset did not contain duplicate rows. In this case, we proceeded to changing of data type prior of outliers removal

Step 2: Change of Data Type
We used df.info() to check the concise summary for our data frame.

It's expected for 'country name' attribute to have 'object' data type properties. Thus, no data type change is needed.
'Positive affect', 'Negative effect' and 'age_ratio' would requires further checking to see if there is any abnormal values present in it.
So, we utilized value_counts() function to get the counts of unique value from these attributes.

As we can see, 'age_ratio' contain non numeric data inside it.
In order to investigate further, we proceed to retrieve rows of data that has 'age_ratio' as 'None' variable.

We noticed that there are multiple columns from 'age_ratio' with 'None' values.
With the help of np.unique(), we identified that 'Palestinian Territories' and 'Taiwan Province of China' having this data issue.

Since we have 2098 rows of preprocessed data, we decided to remove records from 'Palestinian Territories' and 'Taiwan Province of China', in view that these countries do not contain any 'age_ratio' reference values across multiple year.

Next, we proceed to change data type of 'Positive affect', 'Negative affect' and 'age_ratio' to numeric data type.

Step 3: Remove Outliers

Before calling the function, we decided to overlook which variables contained outliers by sketching boxplot.

From the boxplots shown above, we could deduce that:

(i) 'Life Ladder' and 'Log GDP per capita' did not contain outliers;

(ii) 'Social support', 'Healthy life expectancy at birth', 'Freedom to make life choices', 'Generosity', and 'Perceptions of corruption', 'Positive affect', 'Negative affect' & 'age_ratio' contain outliers.

Next, we used Interquartile Rule to detect outliers. Since Python discerned 'year' as one of the numerical variables, we decided to drop the 'year' column at this stage.

We calculated Q1, Q3, and IQR for the df.

After obtaining IQR of variables, we computed lower fence and upper fence function to find outliers.

Outliers were detected and we called tilde function with axis = 1 to drop rows containing outliers. The df_out.shape told us how many rows and columns we left with, which were 1715 rows and 11 columns.

Lastly, we inserted the 'year' column of sort_df back into our free-of-duplicates-and-outliers dataset - df_out. Here done our first two steps of data cleaning.

Step 4: Check Redundancy

Furthermore, we utilised Pearson correlation and heatmap to indicate redundancy among variables. The result showed that the correlation coefficient among variables ranged from -0.73 to 0.81. After examining variable relationship with high correlation coefficient, we determined that they were significant features each in the dataset such as 'Life Ladder', 'Log GDP per capita', 'Healthy life expectancy at birth', and 'age_ratio' therefore we kept all variables.

Step 5: Data Cleaning

We created another copy named 'cleaning_wip_df' to prevent creation of dirty data at the original 'df_out' data frame.

Null Value Handling

We verified that our dataset contained null value.

There was a total of 683 null data in our data set to be handled.

The total counts of null data for each attributes was summarised as:

Let's rearrange our data according to Country name and year attribute in ascending order before we proceeded to null value handling.

We utilised linear interpolation to estimate missing values for:

  1. Log GDP per capita
  2. Social support
  3. Healthy life expectancy at birth
  4. Freedom to make life choices
  5. Generosity
  6. Perceptions of corruption
  7. Positive affect
  8. Negative affect
Linear interpolation is the technique of determining the values of the functions of any intermediate points when the values of two adjacent points are known.
Linear interpolation is basically the estimation of an unknown value that falls within two known values.
The calculated values for each attribute will replace NaN into data frame that we are dealing with with 'inplace = True'.

Linear interpolation was not applied to 'age_ratio' because it was country name dependent.
Thus, we decided to fill in the NaN 'age_ratio' with calculated mean for respective country.

Then, we rechecked the total number of null data again to ensure we successfully filled in estimated values above.

Based on the result, we could see there were remaining 19 rows of NaN values.

We managed to determine that these 19 rows of records belonged to Kosovo, Mali, Nigerr, Palestine Territories, and Taiwan Province of China.

We decided to remove these these listing from our dataset.

We used described function to get the summary of our processed data.
There left 1696 total rows of records in our processed dataset.
There were 150 unique countries. Kazakhstan is one of the the country with highest frequency, 16 in total.
Our processed data ranged from year 2005 to year 2021.

Due to the abnormal distribution of our pre-processed data set, we decided to add back the outliers again for our further analysis in Part 2 of the assignment.

Rename age_ratio to Age ratio.

We shall save this dataset as clean_data.csv for Part 2.

Part 2: Project Seminar

The Project Seminar is the continuation of Part 1 which contains the following sections:-

  1. Exploratory Data Analysis
  2. Machine Learning Modelling
  3. Evaluation
  4. Conclusion

5. Exploratory Data Analysis

Import additional packages required for EDA.

Read the pre-processed dataset and assign to 'happiness'.

EDA - Univariate, Bivariate and Multivariate Analysis

EDA analyses data sets to communicate meaning and proffer knowledge that is hidden in the data using statistical graphics and other data visualisation methods. Generally, EDA encompasses descriptive statistics and visualisation.

Our chosen methods for EDA were shown below:
Univariate analysis: Summary statistics and Histogram
Bivariate analysis: Correlation matrix and Pair plot
Multivariate analysis: Relplot

Univariate Analysis

In this section, we used Sweetviz library to conduct univariate analysis. Univariate analysis involves one variable at a time and it is the simplest form of analysis.

First, we installed the sweetviz library and ran it on our 'happiness' dataset. The library will auto-generate the report in html format and save in our working directory.

The report 'analyze.html' is stored in our working directory by default. We used IPython to depict the report here for clear overview.

From 'analyze.html', we could see that the pre-processed dataset contained 2,052 rows and 12 columns, including 9 independent variables and 1 target variable. The histograms showed that these variables were not evenly distributed and skewed. Hence, we might need to transform our variables before modelling phase.

Bivariate Analysis - Correlation Matrix

Bivariate analysis examines the relationship between two variables. Here, we calculated the correlation between different variables and plotted heatmap using Seaborn library.

Correlation Analysis: Findings

Our metrics for correlation were as follows:

The variable 'year' was to be ignored.

We had several interesting findings from the heatmap above that relate to our dependent variable, Healthy life expectancy at birth:

  1. There was a strong positive correlation between Log GDP per capita and Healthy life expectancy at birth (0.83).
  1. Several variables were also intermediately-positvely correlated with healthy life expectancy at birth, such as Life ladder (0.74), Social support (0.62), and Freedom to make life choices (0.39). Freedom to make life choices leaned more towards the weaker side.
  1. It was found that Age ratio, a variable from another dataset that we had integrated, has strong negative correlation (-0.74) with our dependent variable.
  1. Perceptions of corruption (-0.32) and Positve affect (0.30) have an intermediate correlation with life expectancy.
  1. It was interesting to note that Generosity and Negative affect has got nothing to do with healthy life expectancy at birth at all.
  1. Some bonus findings that displayed strong correlation were:
  1. Generally, we could observe that "unhealthy" variables will show negative correlation while neutral or positive variables will show positive correlation.

Diving deeper, we explored further on our variables which possessed strong or intermediate correlation, either positive or negative with healthy life expectancy at birth :

1. Log GDP per Capita (0.83)

Strongly and positively correlated with Life ladder (0.78) and Social support (0.68). Strongly and negatively correlated with Age ratio (-0.74), which conforms with our usual expectation.


2. Life Ladder (0.74)

Intermediately and positively correlated with Freedom to make life choices (0.53) and Positive affect (0.52). Interestingly, these variables including Social support did not positively affect life expectancy as much as Life ladder did, except Log GDP per capita.


3. Age Ratio (-0.74)

Strongly and negatively correlated with Log GDP per capita.

Bivariate Analysis - Pair Plot

Using pairplot, we produced a descriptive visualisation similar to heatmap, but in the form of scatter plots.

From the scatter plots above, we could once again confirm that 'year' has no relation to whichever variables as it only served as time series in the dataset.

Scatter Plot: Findings

Focusing on the 5th row where Healthy life expectancy at birth acted as the dependent variable, we observed a few of interesting patterns:

Next, we went into details by plotting TWO variables filled by third variable, given the result from Correlation Analysis: Findings.

Multivariate Analysis

1. Log GDP per Capita vs Healthy Life Expectancy at Birth

2. Life Ladder vs Healthy Life Expectancy at Birth

3. Age Ratio vs Healthy Life Expectancy at Birth


Kilby, C.J., & Sherman, K.A. (2016). Delineating the relationship between stress mindset and primary appraisals: preliminary findings. SpringerPlus, 5(336).

6. Machine Learning Modelling

Model 1: Support Vector Machine (SVM)

Format the Data for Support Vector Machine

Step 1: Split the Data into Dependent and Independent Variables

We built a Support Vector Machine for regression.

We first split the data into 2 parts:

  1. The columns of data that we used to make classifications - X

  2. The column of data that we wanted to predict - y

We first dropped healthy life expectancy at birth, Country name and year for our X variable as:

  1. healthy life expectancy at birth is our dependent variable ;

  2. the rest of the variables are categorical data because Support Vector Machine does not natively support categorical data, including them would confuse the algorithm (Dr. Josh from StatQuest).

Next, we created y that only contained the healthy life expectancy at birth column.

Step 2: One-Hot Encoding

Support Vector Machine natively supports continuous data but not categorical machine and thus, One-Hot Encoding needs to be performed to transform the categorical data column into multiple binary data columns.

Since all of our variables were continuous data, this step was skipped.

Step 3: Splitting the data into training and testing set

Support Vector Machine for Regression with Sklearn

We proceeded with the prediction using Support Vector Machine.

Evaluation of the SVM Model on the Test Set

Model evaluation using plot:

Model evaluation using numerical scores:

Model 2: K-Nearest Neighbour (KNN) Regression

KNN regression is a non-parametric regressor. It estimates the association between independent variables and target variable (in continuous data type) by averaging the entities in the same neighbourhood.

Similar to Model 1: Format the Data for SVM, we

Step 1: Split the data into independent and dependent variables;
Step 2: Split the data into training and test set;
Step 3: Transform the training and test set through scaling;
Step 4: Conduct fit and predict of modelling phase;
Step 5: Evaluation.

We removed 'Country name' and 'year' columns to fit in the data requirement of KNN regressor - only continuous variables are allowed for effective modeling.

We split the dataset into training and test data such that training data = 70% whereas test data = 30%.

Since 'Healthy life expectancy at birth' is our target variable, we dropped it from x_train and assigned it to y_train.

Scaling is important when one is dealing with Distance-based algorithms ranged from KNN, K-means and SVM. These algorithms are sensitive to the range of the data points. Hence, we utilised MinMaxScaler to transform independent variables - x_train and x_test.

We did not need to scale y_train or y_test as the model will set the parameter values based on transformed x_train and x_test.

After finished preparing our training and test data following the data requirement of KNN Regressor, we imported necessary packages for the process.

We built a for loop, fit the model on our training data, made prediction on the test data, and calculated the RMSE values.

We plotted an elbow curve to find out which K has the lowest RMSE score.

Since the elbow curve gave us an approximate value of K (between 0.0 and 2.5) with the lowest RMSE score, we used Grid Search CV to capture the best parameter value from a given set of RMSE values.

To see if the model works, we improvised the 9 feature values to predict the 'Healthy life expectancy at birth', which equals to 69.76 years old.

Out of curiosity, we plotted scatter plots for x_train and x_test such that K = 2. The four scatter plots shown below portrayed how x_train, x_test (sample) and neighbour data points interacted with each other.

Evaluation of KNN Regressor

To align with the evaluation process of 4 other different models, we ran another set of simple command of KNN Regressor to obtain the score of Mean Absolute Zero, Mean-Squared Error and R-squared.

Model 3: Multiple Linear Regression, MLR

Multiple Linear Regression, MLR is the most common form of linear regression analysis. As a predictive analysis, the MLR is used to explain the relationship between one continuous dependent variable and two or more independent variables.

Steps to predict the life expectancy using MLR:

  1. Begin with identifying independent and dependent variables
  2. Next, segregate them into training and testing sets
  3. Transform the training and testing datasets via scaling method
  4. Conduct fit and modelling prediction
  5. Perform evaluation

Step 1: Identifying independent and dependent variables

In order to perform predictions on life expectancy, our dependent variables is Healthy life expectancy at birth. Meanwhile, we will remove columns: Country name & year for better modelling results and the remaining variables were our independent variables.

Step 2: Split dataset into training and testing data

Transform data into training and testing sets with the ratio of 4:1

Step 3: Transform the training and testing datasets via scaling method

To standardise our attributes values, scaling on our data is necessary.

Step 4: Conduct fit and modelling prediction

Step 5: Perform evaluation

Model 4: Random Forest

Random forest is one of the supervised machine learning method that is applicable for both classification and regression problems.
Our random forest was made up of multiple decision trees, which combined the outputs from multiple decision trees to reach a single result.

Our Random Forest implementation steps are summarized as:

Step 1: First 5 rows of dataset was selected and displayed for us understand the current situation of dataset.
Step 2: Dataset for training was prepared by shortlisting required features.
Step 3: Dataset was splited into independent and dependent variables.
Step 4: Dataset was splited into training and testing set.
Step 5: Train and test set underwent data transformation through scaling.
Step 6: Fit and predict was conducted during modelling phase.
Step 7: Refined parameters of random forest.
Step 8: Result was displayed for evaluation.

Prepared happiness dataset that come with required features for training.

Then, dataset was divided into training and testing sets through allocation of 80:20.

Since not all attributes in our dataset were scaled, for instance:

At the same time, other attributes had values in range of ones. Thus, we used Scikit-Learn's StandardScaler to scale our data.

Now that we have scaled our dataset.
After that, random forest model was trained to solve our regression problem.

Metrics used to evaluate an algorithm were:
(i) mean absolute error --> less sensitive version of RMSE and MSE to outliers.
(ii) root mean squared error --> average error performed by the model in predicting the outcome for an observation
(ii) mean squared error --> average squared difference between the observed actual outome values and the values predicted by the model
(iii) r2_score (linear model) --> indication as how much of the variation was explained by independent variables
Evaluated performance of the first random forest attempt were summarized as below:

Random Hyperparameter Grid
RandomizedSearchCV will be adopted to increase the performance of random forest.
To use RandomizedSearchCV, we created a parameter grid to sample from during fitting:

Altogether, there were few thousands settings! However, the benefit of a random search was that we were not trying every combination, but random select from wide range of values.

Random Search Training
Random search was initiated and fitted it like any Scikit-Learn model:

Detected best parameters from fitting the random search consist of parameters below:

Evaluate Random Search
Base model and fine tuned model's performance were evaluated to assist us decide which model to select.

Performance was reduced at fine tuned model. Fine tuning didn't always lead us to better result.

Model 5: Multi-Layer Perceptron (MLP)

A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons (with threshold activation).

This multilayer perceptron have 3 hidden layer with 100 neuron, 65 neuron and 32 neuron.

Activation function is using ReLU.

Weight optimization is adam

L2 regularization term alpha value is 0.001

learning rate with 0.001

epoch of 200

epsilonfloat default=1e-8

7. Overall Result Evaluation & Interpretation

In this section, we used R-squared, Mean Absolute Error (MAE) and Mean Squared Error (MSE) to evaluate our 5 models.

R-squared: the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model.
MAE: the difference between the original and predicted values extracted by averaged the absolute difference over the data set.
MSE: the difference between the original and predicted values extracted by squared the average difference over the data set.

Bar Chart for mean absolute error

Bar Chart for mean squared error

Bar Chart for R2 score

Bar Chart for each metrices ranking

Bar Chart for Feature Importances using Random Forest

We wanted to sneak peak on which attribute contributed the most influence when we tried to predict the life expectancy. Thus, we used the second best model which is random forest to find out.

Age ratio, Log GDP per capita and Life ladder have the most influence on life expectancy.

8. Conclusion

EDA

In summary, we ranked the variables below that have correlations with our independent variables in descending order.

Log GDP per capita, Life ladder and Age ratio influence Healthy life expectancy at birth the most whereas Negative affect and Generosity have weak correlation with the target variable.

ML

Our evaluation metrics values on the 5 ML models were summarised as below:

Models MAE MSE R-Squared
Support Vector Machine 3.33 20.14 0.63
KNN Regressor 1.56 6.31 0.89
Multiple Linear Regression 2.53 12.46 0.76
Multi-Layer Perceptron 2.91 14.47 0.74
Random Forest 1.56 5.51 0.90

The ranking of the ML model by types of evaluation metric were as below:

Models MAE MSE R-Squared
Support Vector Machine 5th 5th 5th
KNN Regressor 2nd 2nd 2nd
Multiple Linear Regression 3rd 3rd 3rd
Multi-Layer Perceptron 4th 4th 4th
Random Forest 1st 1st 1st

Both tables indicate that Random Forest worked best on the dataset, followed by KNN Regressor, Multiple Linear Regression (MLR), Multi-Layer Perceptron and Support Vector Machine (SVM). Since KNN Regressor and Random Forest shared almost similar values on R-squared and MAE, we assumed that both of them performed very well.

Since our dataset was not normally distributed, non-parametric model such as KNN Regressor and Random Forest except SVM worked better than parametric model ranged from MLP and MLR. Non-parametric model does not assume a function on the dataset, thus offering a more flexible approach to fit our dataset in (Steorts, n.d.).

In addition, KNN Regressor works better on dataset when the training data (m) is larger than the number of attributes (n) such that m>n (Varghese, 2018). In contrast, SVM works the best when m<n where there are lesser training data but more columns. When we looked at our dataset, we noticed that our m was larger than n. Therefore, SVM only ranked 5th among models during model evaluation.


Steorts, R. (n.d.). Comparison of linear regression with K-nearest neighbors.

Varghese, D. (2018). Comparative study on classic machine learning algorithms. Towards Data Science.

Summary of experience: We felt extremely grateful to have accomplished the project on exploring Happiness Indicators vs Life Expectancy. Throughout the journey, we have gained deeper understanding on data analytics processes, learnt data cleaning methods and conducted different modelling techniques on our chosen dataset. Most importantly, we met with a group of friends who are also keen in delving into the world of data science, and team-worked together.

There are some limitations and constraints in this project such as we could have explored more models and evaluation metrics for the dataset. We hope to bring this on in the future project.

END.